Review the basic questions we can ask about ASSOCIATION between any two variables:
does it exist?
how strong is it?
what is its direction?
Introduce a widely used analytical tool: REGRESSION
The examples and code from this lab session follow very closely …..:
Topics discussed in Lecture # 4
Lecture 4: topics
Shifting the emphasis on empirical prediction
Distinction between supervised & unsupervised algorithms
Unsupervised ML Example
PCA
Useful R resources for metabolomics
Introduction to MetaboAnalyst software
Elements of statistical power analysis
R ENVIRONMENT SET UP & DATA
Needed R Packages
We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We will also use the packages below (specifying package::function for clarity).
# Load them for this R session# General library(fs) # file/directory interactionslibrary(here) # tools find your project's files, based on working directory
here() starts at /Users/luisamimmi/Github/R4biostats
library(paint) # paint data.frames summaries in colourlibrary(janitor) # tools for examining and cleaning data
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(dplyr) # {tidyverse} tools for manipulating and summarizing tidy data
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(forcats) # {tidyverse} tool for handling factorslibrary(openxlsx) # Read, Write and Edit xlsx Fileslibrary(flextable) # Functions for Tabular Reporting# Statisticslibrary(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
Attaching package: 'rstatix'
The following object is masked from 'package:janitor':
make_clean_names
The following object is masked from 'package:stats':
filter
library(lmtest) # Testing Linear Regression Models # Testing Linear Regression Models
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
library(broom) # Convert Statistical Objects into Tidy Tibbles#library(tidymodels) # not installed on this machinelibrary(performance) # Assessment of Regression Models Performance # Plottinglibrary(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
DATASETS for today
We will use examples (with adapted datasets) from real clinical studies, provided among the learning materials of the open access books:
Importing Dataset 1 (NHANES)
Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999. Documentation: dataset1 Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)
Adapting the function here to match your own folder structure
NHANES Variables and their description
[EXCERPT: see complete file in Input Data Folder]
MACHINE LEARNING: A FOCUS ON PREDICTION
…
…
…
…
Splitting the dataset into training and testing samples
Julia Silge https://supervised-ml-course.netlify.app/
…
…
…
…
_______
ML WITH SUPERVISED ALGORITHMS
…
PCA: step by step (example)
PCA fatta a mano. PCA step by step come in Statology ma con il data set della Lecture nmr_bins…csv
Probabilmente non viene proprio uguale perchè in MA fa normalizzazione e scaling mentre Statology fa solo scaling, ma fa niente, diciamo che ci serve per vedere la differenza
Se non hai tempo o non si riesce l’alternativa è che li faccio giocare anche loro con il MetaboAnalyst anche nelle esercitazioni, sperando che la rete regga e la piattaforma pure..
…
…
…
…
_______
SAMPLE SIZE… 🙀 a.k.a. “the 1,000,000 $ question”!
_______
…
Final thoughts/recommendations
The analyses proposed in this Lab are very similar to the process we go through in real life. The following steps are always included:
Thorough understanding of the input data and the data collection process
Bivariate analysis of correlation / association to form an intuition of which explanatory variable(s) may or may not affect the response variable
Diagnostic plots to verify if the necessary assumptions are met for a linear model to be suitable
Upon verifying the assumptions, we fit data to hypothesized (linear) model
Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
As we saw with hypothesis testing, the assumptions we make (and require) for regression are of utter importance
Clearly, we only scratched the surface in terms of all the possible predictive models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis
e.g. broom (within tidymodels), performacerstatix, lmtest